Conversation
898865c to
6b8a5a2
Compare
tonyxiao
left a comment
There was a problem hiding this comment.
Review
Overall a high-quality PR — the buildSpecAwareListFn refactor, supportsForwardPagination propagation, StripeApiRequestError/assertOk, and withHttpRetry are all well-done. A few issues worth addressing before merge.
Critical
waitForErrorRecovery has a race with continueAsNew (pipeline-workflow.ts)
desiredStatusSignalCount is transient workflow-local state — it's not included in continueAsNew opts. After a history rollover while in errored state, the new run starts with desiredStatusSignalCount = 0, captures signalCount = 0, then the condition desiredStatusSignalCount > signalCount is immediately 0 > 0 = false. The workflow blocks waiting for a new signal, even if the operator already sent active before the rollover. Recovery should key on the desiredStatus value itself rather than a counter — or use a dedicated acknowledgeErrorSignal that is distinct from desiredStatusSignal.
system_error covers transient 500s, causing false permanent failures (sync-errors.ts)
PERMANENT_FAILURE_TYPES = new Set(['system_error', 'config_error']) — but the source emits failure_type: 'system_error' for any non-rate-limit HTTP error, including transient 500s that survived all withHttpRetry retries. A Stripe infrastructure outage lasting longer than the retry window would permanently error every pipeline and require operator intervention. Consider either narrowing PERMANENT_FAILURE_TYPES to only config_error, or introducing a distinct auth_error type for 401/403 so that system_error can stay transient at the Temporal layer.
Important
Mixed permanent+transient errors in one sync run silently drops the transient errors (pipeline-sync.ts)
If a single run produces both permanent and transient errors, the if (permanent.length > 0) branch returns early — the transient errors are never retried and never surfaced to the caller.
markPermanentError forces phase: 'backfilling' even when called from liveLoop (pipeline-workflow.ts)
When a live-event sync fails permanently, the pipeline may already be in phase: 'ready' (backfill complete). Resetting to 'backfilling' means after recovery the pipeline runs a full re-backfill instead of resuming live processing. markPermanentError should leave phase unchanged — errored: true alone is sufficient to halt processing.
New test suites permanently excluded from CI with no alternative run path (ci.yml)
All four new suites (test-server-all-api, test-server-sync, test-sync-e2e, test-sync-engine) plus test-e2e-network are excluded with no fallback — no nightly job, no label trigger. test-e2e-network in particular covers the recovery scenarios being added in this PR. If they're intentionally local-only, a comment in the YAML explaining why would help; if they're meant to run somewhere, a plan for where would be good.
Minor
requestWithRetryinclient.tssilently skips retry for POST/DELETE — the name implies otherwise. A rename torequestWithGetRetryor a JSDoc noting the GET-only policy would prevent future callers from being surprised.quoteIdentifierinstorage.tsrejects hyphens (^[A-Za-z_][A-Za-z0-9_]*$), which are common in CI Postgres schema names. Since the function already quotes with"...", the validation is overly restrictive without adding security value in a test-only context.test-sync-e2e.test.tssource configs are missingapi_version(required per project conventions).
45a434a to
68ce297
Compare
…p, Revert unrelated dest-postgres changes
…d optimize test server
Summary
Add a local, OpenAPI-driven Stripe list server (
@stripe/sync-test-utils) so tests no longer depend on the remote rate-limited Stripe API. The server runs in-process against Postgres, implements list/retrieve for every endpoint discovered from the spec, and supports auth guards and fault injection.Building the test infrastructure uncovered and fixed bugs in the production code.
Bug fixes
limit,starting_after, andending_beforewere sent to every endpoint regardless of whether the OpenAPI spec advertises them (e.g.reporting_report_types), causing 400 errors. AddedbuildSpecAwareListFnthat filters params based on the spec.has_morewas trusted andstarting_afterwas sent even for endpoints that don't support forward pagination, potentially causing infinite loops or data loss. AddedsupportsForwardPaginationflag derived from the spec.createdfilters ignored — the source hardcodedsupportsCreatedFilter: falsefor all v2 endpoints even when the spec declarescreated, skipping time-range filtering during segmented backfill.createdquery params —buildListFninlistFnResolver.tsnever wiredparams.createdinto v2 request URLs. Now encodescreated[gte]/created[lt]with ISO timestamp conversion.buildListFnandbuildRetrieveFncalledresponse.json()without checkingresponse.ok. A 401 or 500 was parsed as if it were valid data. AddedassertOk+StripeApiRequestErrorto throw explicitly.getAccount, list, and retrieve calls had no retry logic. A single transient 500 from Stripe would fail the entire setup/sync. AddedwithHttpRetry(exponential backoff, retries on 429/5xx/network errors).readyeven when the sync had permanent errors. AddedclassifySyncErrorsto distinguish transient vs permanent failures, with proper error propagation in the Temporal workflow.destination-postgrespool errors could crash the process. Addedpool.on('error', ...)handler.New package:
@stripe/sync-test-utilsOpenAPI-driven local Stripe list/retrieve server backed by Postgres:
createdfilters,has_more/next_page_url)New tests
resourceRegistry,clientretry behavior,listFnResolverpagination flags and error propagation,indexsource setup/backfill flowsOther changes
compose.e2e.yml)objectGenerator.ts) for schema-based test fixture generation